Skip to content

[FLINK-24544][formats] Fix Avro enum deserialization failure with Confluent Schema Registry#28488

Open
rodriguezc wants to merge 1 commit into
apache:masterfrom
rodriguezc:fix/FLINK-24544-avro-enum-deserialization
Open

[FLINK-24544][formats] Fix Avro enum deserialization failure with Confluent Schema Registry#28488
rodriguezc wants to merge 1 commit into
apache:masterfrom
rodriguezc:fix/FLINK-24544-avro-enum-deserialization

Conversation

@rodriguezc

@rodriguezc rodriguezc commented Jun 19, 2026

Copy link
Copy Markdown

References

This implementation was inspired by PR #27591(FLINK-24544), which addressed schema compatibility between writer and reader schemas. The current fix adapts the approach specifically for enum/string type mismatches while preserving field projection capabilities.

What is the purpose of the change

This pull request fixes a deserialization failure when consuming Avro records with enum fields from Confluent Schema Registry. When the writer schema (from Schema Registry) declares a field as an ENUM type but the reader schema (from Flink DDL) declares the same field as STRING, the GenericDatumReader fails during deserialization because it expects to find an enum in the data but the reader schema instructs it to produce a string output.

The fix introduces a schema merging strategy in RegistryAvroDeserializationSchema.deserialize() that creates a hybrid "expected schema":

  • From the reader schema: Preserves the field list and field order (enabling field projection - reading only a subset of fields)
  • From the writer schema: Substitutes ENUM types for fields where the reader declared STRING but the writer has ENUM

This allows GenericDatumReader to correctly deserialize enum values as GenericEnumSymbol objects, which are then converted to StringData by AvroToRowDataConverters via the .toString() method.

Brief change log

  • Added mergeSchemaTypes() method in RegistryAvroDeserializationSchema to create hybrid reader/writer schemas
  • Schema merge specifically handles enum/string type mismatches by preferring the writer's enum type
  • Schema merge preserves reader schema field structure to maintain field projection support
  • Modified AvroToRowDataConverters to use name-based field access for GenericRecord (instead of positional)
  • Added comprehensive test coverage for enum deserialization with various scenarios

Verifying this change

This change added tests and can be verified as follows:

  • Added RegistryAvroDeserializationSchemaTest.testNestedRecordWithEnumField() that validates:
    • Enum to string conversion works correctly
    • Nullable enum fields are handled properly
    • Field projection (subset of fields) works with enum types
    • Nested records with enum fields are deserialized correctly
    • Note: The test schema structure is based on real-world CDC from Qlik Replicate, which commonly produces nested records with enum fields (e.g., CDC metadata like operation enum types)
  • Existing tests continue to pass, ensuring no regression in field projection functionality

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: yes (Avro deserialization logic)
  • The runtime per-record code paths (performance sensitive): yes (adds schema merging on each deserialization)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable (bug fix)
Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    Generated-by: Claude Code (Claude Sonnet 4.5)

@flinkbot

flinkbot commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants